Title-Block Based Web Page Reorganization

نویسنده

  • Qihua Chen
چکیده

For cell phone users and blind people using non-visual browsers, browsing Web by common browsers is quite inefficient due to the problem of information overload. This paper presents the TB-WPRO (Title-Block based Web Page Re-Organization) method, which hierarchically segments web pages into blocks using visual and layout information reflecting the web designers’ intent. TB-WPRO segments the web pages with a clear goal to extract self-described title blocks. To reorganize web pages, the segmentation result is transformed to a serial of small web pages that could be easily accessed. Compared to current methods, the proposed approach obtains a promising segmentation result where blocks are visually and semantically consistent with original web pages. inefficiency. Therefore, for these applications, analysis and reorganization of the web page become inevitable. For web pages containing a main text (e. g., a news story), the problem is relatively easy, since the main story can be segmented and extracted according to text percentage and area. However, for web pages without main text and serving as “index pages” or “hub pages” (e. g., home page of most websites), analysis and reorganization of such web pages remains a major challenge. Earlier methods merely extract all texts from web pages and reorganized the texts and display them on small screens or transform them into speech for non-visual browsing. Obviously, these methods do not solve the problem of information overload and DOI: 10.4018/japuc.2011010107 56 International Journal of Advanced Pervasive and Ubiquitous Computing, 3(1), 55-62, January-March 2011 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. make browsing inefficient. Nowadays, much research effort has been paid on segmenting a web page into small blocks and transforming the web page into a sequence of blocks. However, for both methods using text information or visual information, current segmentation performance remains dissatisfactory: Blocks obtained by current methods are often not visually and semantically consistent with the original web pages. Furthermore, current methods just segment web pages into blocks and allow users to skip between blocks, which are still inefficient in browsing. In this paper, the TB-WPRO (title-block based web page reorganization) method is proposed. The method segments web pages from the designer’s perspective using both visual information and page layout. The method is mainly used for segmentation of “index pages” or “hub pages” as introduced above. The main idea is to extract title blocks from web pages and reorganize them in a hierarchical way. A title block is a block with a title which describes the category of the block and a main content within the category. Compared to current methods, the proposed approach can obtain a promising segmentation result where blocks are visually and semantically consistent with original web pages. Furthermore, the proposed method can filter less important contents such as navigation bars and some advertisements by only considering title blocks, which helps in dealing with the challenge of information extraction. The rest of this paper is organized as follows: Section 2 provides a brief review of related work. Section 3 and Section 4 presents the page segmentation method and content reorganization methods in TB-WPRO, respectively. Experimental results are given and analyzed in Section 5. Finally, conclusions are drawn in Section 6.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TB-WPRO: Title-Block Based Web Page Reorganization

For cell phone users and blind people using non-visual browsers, browsing Web by common browsers is quite inefficient due to the problem of information overload. This paper presents the TB-WPRO (TitleBlock based Web Page Re-Organization) method, which hierarchically segments web pages into blocks using visual and layout information reflecting the web designers’ intent. TB-WPRO segments the web ...

متن کامل

TB - WPRO : Title - Block Based Web

For cell phone users and blind people using non-visual browsers, browsing Web by common browsers is quite inefficient due to the problem of information overload. This paper presents the TB-WPRO (TitleBlock based Web Page Re-Organization) method, which hierarchically segments web pages into blocks using visual and layout information reflecting the web designers’ intent. TB-WPRO segments the web ...

متن کامل

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

In this paper, we describe a Web page segmentation method based on title blocks and show its evaluation. Title blocks are minimum blocks that function as headlines for specific Web content. A typical Web page consists of multiple elements with different types of features, such as main content, navigation panels, copyright and privacy notices, and advertisements. Web page segmentation is the div...

متن کامل

Using Document Structure on Retrieving Webpages at the Web-CLEF 2006

We present a report on our participation in the mixed monolingual web task of the 2006 Cross-Language Evaluation Forum (CLEF). We compared the result of web page retrieval based on the page content, page title, and anchor page. The retrieval effectiveness for the combination of page content, page title, and anchor texts was better than that of the combination of page title and page title only. ...

متن کامل

Using Web Page Titles to Rediscover Lost Web Pages

Titles are denoted by the TITLE element within a web page. We queried the title against the the Yahoo search engine to determine the page’s status (found, not found). We conducted several tests based on elements of the title. These tests were used to discern whether we could predict a pages status based on the title. Our results increase our ability to determine bad titles but not our ability t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015